feat: Expand benchmark, update params#24
Merged
Merged
Conversation
Remove tasks where target files moved to external packages in newer versions (express v5 router/middleware, chi cors/redirect, ecto migration, phoenix template, rack session). Fix paths for jackson- databind BeanDeserializer, kotlinx-coroutines CoroutineContext, nlohmann-json json_pointer, and circe DecodingFailure.
- NL alpha 0.6 -> 0.5: equal weight semantic + BM25 (BM25 finds targets 2.3x more often than semantic among failure queries) - Stem boost multiplier 0.5 -> 1.0: stronger file-path keyword signal - Match ratio threshold 0.20 -> 0.10: boost files when any keyword matches, even for longer queries NDCG@10 on 50-repo benchmark: 0.838 -> 0.851 (+0.013)
Add semantic/architecture/symbol categories to 212 tasks across 14 repos that were missing them. Add 11 new express tasks to restore coverage after broken annotations were removed (9 -> 20 tasks). Total: 930 tasks across 48 repos, all categorized.
- commons-lang: reflectionEquals span 89-99 -> 179-318 (class header is not the reflection logic) - circe: auto/semiauto derivation target was Decoder.scala (wrong file), now points to generic/auto.scala + semiauto.scala - exposed: SchemaUtils target was abstract SchemaUtilityApi.kt, now points to the concrete SchemaUtils.kt in exposed-jdbc - sinatra: halt/pass/redirect span too narrow, use whole-file - sinatra: Rack build() method span was setup_default_middleware helper, now points to the actual build() method at line 1670 - sinatra: Helpers symbol span extended to cover halt (1028) and pass (1036)
guzzle +5, ktor +4, sinatra +4, messagepack-csharp +3, alamofire +3, tokio +3, trpc +3, cats +3. All repos now have >= 20 tasks. Total: 954 tasks across 48 repos.
- Add curl, redis, bats-core, aeson, http-dart, telescope.nvim, lazy.nvim, zig - 160 new annotation tasks (20 per repo) - Add .bash, .zig, .hs file extensions to file_walker - Overall NDCG@10: 0.841 across 56 repos
…mean-of-language-means - Add 10 new repos: nvm, bash-it (replaces gitflow-avh), pandoc, xmonad, dio, riverpod, nvim-lspconfig, mini.nvim, zls, zig-clap - Bring bash, haskell, dart, lua, zig all to 3+ repos - Fix run_benchmark.py aggregation: headline NDCG@10 is now mean of per-language means (one vote per language, not per repo), which previously over-weighted Python's 9 repos - Fix numpy float type annotation issue (float() cast on np.median) - New headline: NDCG@10 = 0.829 across 20 languages (66 repos)
… annotation audit - Fix n_relevant to use annotation count instead of index coverage (reviewer #5) - Add per-category NDCG@10 to printed summary and saved JSON (reviewer #7) - Replace 11 trivially-lexical semantic queries with vocabulary-diverse alternatives - Baseline: NDCG@10 = 0.825 (architecture=0.773, semantic=0.823, symbol=0.943)
…ent-scoped one - ktor: the server application query targeted files outside the benchmark_root; replaced with a client-side plugin pipeline query that indexes correctly - rxswift: Observable.swift is a thin declaration file; corrected relevant target to ObservableType.swift which contains the actual protocol definition - Swift +0.006, Kotlin +0.004, architecture category +0.002
- sinatra: fix 3 queries pointing to wrong/narrow line ranges in base.rb - circe: replace out-of-scope generic derivation query (targets modules/generic/ which is outside benchmark_root) with DecodingFailure/ParsingFailure query targeting Error.scala in core - cats: replace Semigroup/Monoid query pointing to kernel/ module (outside root) with MonoidK/SemigroupK query targeting core - rxswift: add Zip+arity.swift as second relevant for zip operator query - exposed: add Transactions.kt as second relevant for transaction block query NDCG@10: 0.825 (baseline) -> 0.830
Remove outdated result files from previous benchmark runs and add fresh result from current HEAD (NDCG@10=0.830).
- Remove nvim-lspconfig (4th lua repo, lowest score 0.583) to keep all languages at 3 repos - Fix bash-it and libuv annotations using non-standard 'api' and 'keyword' categories; remap to 'architecture' and 'symbol' - Refresh benchmark results: NDCG@10 = 0.833
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The PR expands the benchmark from 29 repos (12 languages) to 66 repos (20 languages) for a total of 1318 queries. The main metric is also changed to the mean or per-language means to get a more balanced view of how well semble works across languages.
This means the old benchmark scores are not valid anymore. There's also a few params that are tuned.
Current dataset overview: